# Video understanding

Vjepa2 Vitl Fpc64 256
MIT
V-JEPA 2 is a cutting-edge video understanding model developed by the FAIR team under Meta. It extends the pre-training objectives of VJEPA and has industry-leading video understanding capabilities.
Video Processing Transformers
V
facebook
109
27
Internvl3 8B Hf
Other
InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text Transformers Other
I
OpenGVLab
454
1
Internvl3 2B Hf
Other
InternVL3-2B is a multimodal large language model implemented based on the Hugging Face Transformers library. It performs excellently in multimodal tasks such as image, video, and text processing, supporting multiple input methods and efficient batch inference.
Image-to-Text Transformers Other
I
OpenGVLab
41.22k
2
Smolvlm2 2.2B Instruct
Apache-2.0
SmolVLM2-2.2B is a lightweight multimodal model designed for analyzing video content. It can process video, image, and text inputs and generate text outputs.
Image-to-Text Transformers English
S
HuggingFaceTB
62.56k
164
Xgen Mm Vid Phi3 Mini R V1.5 32tokens 8frames
xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model equipped with an explicit temporal encoder, specifically designed to understand video content.
Video-to-Text Safetensors English
X
Salesforce
441
3
Videomae Base Finetuned Subset
A video understanding model fine-tuned on an unknown dataset based on the MCG-NJU/videomae-base model, with an accuracy of 67.13%
Video Processing Transformers
V
Joy28
2
0
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase